We will use a dataset that was used in the paper: I just ran two million regressions by Sala-I-Martin and Model uncertainty in cross country growth regression by Fernandez et. al. The dataset has 41 possible explanatory variables with 72 countries. The data consists of 43 columns with the Country, y (economic growth in per capita GDP) and 41 possible predictor variables.
We now conduct model selection by exhaustive search. Note that we have 2^41 which is approximately 2 trillion possible regressions to run. The leaps package has some smart ways to search over this space by avoiding visiting parts of the space where the optimum cannot exist. It employs a branch and bound algorithm to search more efficiently. This would take a few minutes to run on a laptop. The model shows the bias-variance trade-off. We can also plot the variables identified using the plot command.
library(leaps)
model1 <- regsubsets(y ~ .,data = eg1, nvmax = 41)
plot(summary(model1)$rsq)
plot(summary(model1)$adjr2)
which.max(summary(model1)$adjr2)
[1] 27
plot(model1, scale = c("r2"))
plot(model1, scale = c("adjr2"))
We next use the forward stepwise selection method which runs much faster as should be expected. Note that the results are not identical to what we obtained with the exhaustive selection approach.
model2 <- regsubsets(y ~ ., data = eg1, nvmax = 41, method = "forward")
plot(summary(model2)$rsq)
plot(summary(model2)$adjr2)
plot(model2, scale = c("r2"))
plot(model2, scale = c("adjr2"))
Which variables are selected:
summary(model1)$which[27,]
summary(model2)$which[27,]
The results indicate that with model 1, we have
1) EquipInv,
2) Confucian, EquipInv,
3) Buddha, Confucian, EquipInv,
4) YrsOpen, Confucian, Protestants, EquipInv
while for model 2, we have
1) EquipInv,
2) Confucian, EquipInv,
3) Buddha, Confucian, EquipInv,
4) Buddha, Protestants, EquipInv, Confucian
and so on. The results are different from the two models.
LASSO model: The results indicate that for variables such as EquipInv, YrsOpen and Confucian for many values of lambda, these occur while some other variables such as Abslat do not show up as often. Such results help illustrate the reliability of possible predictors for economic growth and can also cast doubts on the robustness of the results for certain variables which might be proposed as being correlated with growth.
model4$beta["EquipInv",]
s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11
0.00000000 0.02989896 0.05202483 0.06760673 0.08042867 0.09212352 0.10277946 0.11258867 0.12154191 0.12971174 0.13715581 0.14406270
s12 s13 s14 s15 s16 s17 s18 s19 s20 s21 s22 s23
0.15040049 0.15383321 0.15867593 0.16258346 0.16444915 0.16555791 0.16661977 0.16734255 0.16832148 0.16948025 0.17126770 0.17293008
s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35
0.17351524 0.17220683 0.16702669 0.16130824 0.15651196 0.15187566 0.14705527 0.14234527 0.13738774 0.13193100 0.12706968 0.12274067
s36 s37 s38 s39 s40 s41 s42 s43 s44 s45 s46 s47
0.11929709 0.11510558 0.11298039 0.11149575 0.11069796 0.10979416 0.10902316 0.10822285 0.10826583 0.10805193 0.10978025 0.11122019
s48 s49 s50 s51 s52 s53 s54 s55 s56 s57 s58 s59
0.11265440 0.11364832 0.11454541 0.11517577 0.11571790 0.11637044 0.11769020 0.11837771 0.11911269 0.11981383 0.12046601 0.12107996
s60 s61 s62 s63 s64 s65 s66 s67 s68 s69 s70 s71
0.12163600 0.12214121 0.12259898 0.12310262 0.12363149 0.12418720 0.12446285 0.12470910 0.12491215 0.12509626 0.12525557 0.12540204
s72 s73 s74 s75 s76 s77 s78 s79 s80 s81 s82 s83
0.12553559 0.12566579 0.12577917 0.12588798 0.12597718 0.12607745 0.12615294 0.12619169 0.12622857 0.12626553 0.12629465 0.12632607
s84 s85
0.12635177 0.12637656